Triton 编程入门：从线程到程序实例的过渡

在 Triton 中，执行的基本单位从 CUDA 标量线程转变为程序实例。这代表了 GPU 线程块的一个抽象，其中单个实例同时处理一个向量化“块”的元素。

每个执行单元通过以下方式获取其身份 pid = tl.program_id(axis=0)。可以将其想象为一个 仓库叉车 （程序实例）搬运一个托盘（块）上的 128 个箱子，而单个工人（CUDA 线程）只搬运一个箱子。

理解语义差异对于内存管理至关重要：

PyTorch 视图
指向连续全局内存的 Python 对象。

Triton 视图
编译器寄存器内的二维/一维数据块。

Triton 采用 单程序多数据（SPMD） 模式。每个程序实例都执行 完全相同的 代码。分歧仅在逻辑使用 pid 来计算特定的内存偏移时才会发生。

TERMINALbash — 80x24

> Ready. Click "Run" to execute.

QUESTION 1

What is the primary identifier for a Triton execution unit?

threadIdx.x

tl.program_id(axis=0)

tl.block_idx()

torch.get_id()

QUESTION 2

True or False: A Triton tensor is a Python object that stores metadata like strides on the host CPU.

True

False

QUESTION 3

What is the result of 'forgetting that all program instances execute the same kernel body'?

The compiler will automatically distribute tasks.

Race conditions or overwriting memory if pid-based logic is missing.

The kernel will fail to compile due to a syntax error.

Execution time will double.

QUESTION 4

In the forklift analogy, what does the 'Aisle Number' represent?

The BLOCK_SIZE

The program_id (pid)

The GPU Driver version

The Pointer address

QUESTION 5

Why is the Triton model considered 'Vectorized' compared to CUDA?

It uses Python lists.

One Program Instance handles a block of elements, not just one scalar element.

It only works with 2D matrices.

It runs on the CPU's SIMD units.